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Finding plagiarism strings between two given documents are the main task of 
the plagiarism detection problem. Traditional approaches based on string 
matching are not very useful in cases of similar semantic plagiarism. Deep 
learning approaches solve this problem by measuring the semantic similarity 
between pairs of sentences. However, these approaches still face the following 
challenging points. First, it is impossible to solve cases where only part of a 
sentence belongs to a plagiarism passage. Second, measuring the sentential 
similarity without considering the context of surrounding sentences leads to 
decreasing in accuracy. To solve the above problems, this paper proposes a 
two-phase plagiarism detection system based on multi-layer long short-term 
memory network model and feature extraction technique: (i) a passage-phase 


Multi-layer long — short-term to recognize plagiarism passages, and (ii) a word-phase to determine the exact 

memory plagiarism strings. Our experiment results on PAN 2014 corpus reached 

Plagiarism detection 94.26% F-measure, higher than existing research in this field. 
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1. INTRODUCTION 

Plagiarism is defined as the reuse of another person’s ideas, processes, results, or words without 
explicitly acknowledging the source [1]. Plagiarism detection is the algorithm for automatically retrieving 
strings in a suspicious document reused from another document. Plagiarism methods are divided into two 
main types: literal plagiarism and intelligent one, based on the plagiarist’s behavior [2]. Literal plagiarism is 
a common and popular case in which plagiarists do not spend much time hiding the academic crime they 
committed. For example, they copy and paste the text from the internet. Intelligent plagiarism is severe 
academic dishonesty wherein plagiarists try to deceive readers by changing others’ contributions to appear as 
their own. Intelligent plagiarists try to hide, obfuscate, and change the original work in various intelligent 
ways, including text manipulation, translation, and idea adoption. 

Over the past two decades, automatic plagiarism detection has received significant attention from 
the research community. Two main tasks of automatic plagiarism detection are source retrieval and text 
alignment. In the source retrieval task, given a suspicious document and a web search engine, the task is to 
retrieve all source documents from which text has been reused. In the text alignment subtask, given a pair of 
documents (a suspicious document and a source one), the task is to identify contiguous maximal-length 
passages of reused text. 
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Most of existing works on text alignment focus on supervised and unsupervised approaches. Several 
unsupervised approaches use character-based methods (e.g., [1], [3], [4]) that applied string matching or 
approximate string matching with measures such as Hamming or Levenshtein distances to compute the 
similarity between two strings within a sliding window. Instead of comparing strings as in character-based 
methods, vector-based methods (e.g., [5], [6]) proposed representing input texts as vectors of tokens and 
measuring the distance between these vectors by using similarity coefficients such as Jaccard, Cosine, 
Euclidean, or Manhattan distances. 

Based on the intuition that similar documents would have similar syntactical structures, some 
research works (e.g., [7], [8]) used syntactic information at the first stage of measuring sentential similarity. 
The main limitation of these unsupervised approaches is that they cannot deal with intelligent plagiarism in 
which the same content can be expressed by different words and in different orders. Research on intelligent 
plagiarism (e.g., [9]-[11]) often concentrate on finding the similarity between pairs of sentences. Gharavi et al. [9] 
proposed a plagiarism detection method for the Persian language by representing each sentence by a semantic 
embedding vector and then comparing the similarity between these vectors using the cosine similarity. 

Cherroun et al. [10] proposed a two-phase system using a supervised learning approach to detect 
plagiarism in Arabic. The first phase produced a representing vector for each sentence by combining different 
features, including word embedding, word alignment, term frequency weighting, and part-of-speech tagging. 
The second phase used lexical, syntactic, and semantic features in three machine learning models (support 
vector machine (SVM), decision trees (DT), and random forests (RF)) to improve the accuracy of the first 
phase results. However, their approach did not deal with obfuscated plagiarism cases when a passage is 
inserted in the middle of a sentence. Altheneyan ef al. [11] presented two systems (PlagLinSVM and 
PlagRbfSVM) using the support vector machine classifier (SVM) with lexical, syntactic, and semantic 
features to detect plagiarism sentences. Their approach applied two plagiarism detecting levels: paragraph 
and sentence ones. The paragraph-level detects similar paragraphs in the two input documents basing on the 
number of common unigrams and bigrams of these paragraphs. The sentence-level aligns sentences in the 
above result paragraph pairs basing on the number of common unigrams between the two sentences. If the 
score of a sentence pair was higher than the pre-defined threshold, the SVM classifier is applied to determine 
whether two sentences are similar or not. Finally, plagiarism passages were created by connecting adjacent 
sentences that were copied from the source documents. 

Previous intelligent plagiarism approaches have limitations on finding copied paragraphs based on 
sentence units, assuming that people only copy or rewrite sentences. However, existing cases of plagiarism 
are more complicated than that. When comparing the plagiarism strings and the source one, we found that 
they can be different in; (i) the number of sentences; (ii) the sentence length; and (iii) the text appearance’s 
order. The above situations are not resolved yet in existing research on plagiarism detection. 

Recently, deep learning approaches have proven to be efficient in solving many tasks of natural 
language processing. However, as far as we know, the largest training corpus for the plagiarism detection 
task is still very small for the training phase. Therefore, in this paper, we propose a plagiarism system that 
takes advantage of hand-crafted feature vectors and long short-term memory (LSTM) network model [12] to 
deal with the problems mentioned above. The system includes two main phases: 

—  passage-phase to figure out plagiarism passages in suspicious and source documents. 

—  word-phase to remove redundancy parts from plagiarism passages to achieve the exact plagiarism 
strings. 

The main contributions of this work are: 

— We proposed new features at both the passage and word level to improve the accuracy in detecting 
similar strings between two documents. These features are: (i) Maximize passage similarity, maximize 
passage intersection, passage importance at the passage-phase; and (ii) word similarity, average word 
similarity, sentence based similarity at the word-phase. 

— We proposed a two-phase plagiarism detection system based on a multi-layer LSTM network model 
using our proposed features to solve both literal and intelligent plagiarism problems. 

The rest of the article is organized as: our proposed method is introduced in section 2. In section 3, 
we describe our experiments and analyze the results. Finally, our conclusions and future research directions 
are presented in section 4. 


2. PROPOSED METHOD 
The problem of finding similar strings between two documents is stated is [13]: 
Definition 1: Given two documents d and d’, the goal is to detect a set of passage pairs, P, such as: 


b= {< Pa; Pa’; > |V Pay VPa’,: Pay ed A par,ed'A [PaO Pa’;| > 3} (1) 
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in which pg, is a string from d; Par, is a string from d’; pg, Pa’; indicates the similarity between 
Pa; and Pa’; 5 is a threshold that is used to determine whether two strings are similar enough to be 


considered as plagiarism. 

The series of competition shared tasks for plagiarism detection named plagiarism analysis, 
authorship identification, and near-duplicate detection (PAN) has defined four types of plagiarism. 

a. None obfuscation: Create plagiarism cases by copying a paragraph from the source document and insert 
it into the suspicious one. 

b. Random obfuscation: Create plagiarism cases by inserting, deleting, changing the order of words from a 
paragraph of the source, and inserting it into the suspicious document. 

c. Translation obfuscation: Create plagiarism cases by translating a paragraph more than once through 
several languages and back to the original language using different machine translation tools. Then, 
inserting the translated paragraph into the suspicious document. 

d. Summary obfuscation: Create plagiarism cases by summarizing the source paragraph and inserting it 
into the suspicious document. 

This paper aims at solving plagiarism cases belong to all four types above. Our proposed system’s 
workflow is shown in Figure 1, including three steps. 

—  Pre-processing: This step splits input documents into sentences, removes stopwords and special 
characters, and combines sort sentences into one. 

—  Passage-phase: After the pre-processing step, we use a context window sliding over the source and 
suspicious documents to create candidate passages. We extract features from these passages and 
generate an input feature matrix corresponding to these features. This matrix is feed into a binary 
classifier of the candidate selection module to obtain pairs of plagiarism passages. 

—  Word-phase: The pairs of plagiarism passages are used as the input for the word-phase. The purpose of 
this phase is to define the exact plagiarism strings from the input passages. A binary classifier at the 
word-level is used to perform this task. 
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Figure 1. Overview of the proposed system’s workflow for plagiarism detection 


2.1. Pre-processing 

The input documents are split into sentences using the sent tokenizer tool from the NLTK library. 
Then stopwords are removed from these sentences. Some specific cases can affect the accuracy of plagiarism 
selection. These cases are: 

— The input documents contain numbers that are written incorrectly, such as “8. 39’, ‘7 p. m’. In this case, 
the sentence splitter incorrectly segments text into sentences at the dot (‘°.’) character. 

— After removing stopwords, there are some short sentences containing none or only one or two tokens. 
For example, two sentences “Can you feel the burn?”, “Who we are?” remain two words and empty, 
respectively, after cleaning stopwords and punctuation characters. 

Since the similarities of short sentences do not have much meaning, we combine the short sentences 
with surrounding sentences and compare the similarity between the passages after combined. Therefore, to 
deal with the problems mentioned above, we first apply the sentence splitter and then remove stopwords, 
numbers, and special characters from the sentences. After cleaning the text, sentences with less than three 
words are combined with the next sentence to create extended sentences. To the best of our knowledge, the 
above combination step allows us to efficiently manage the passage’s length after pairing and avoiding the 
case of creating too-long passages. We use a window of size w (sentences) sliding on both suspicious and 
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source documents to generate candidate plagiarism passages, which are used as inputs of the passage-phase. 
The optimal window size for the PAN datasets is three sentences. 


2.2. Passage-phase 

The input of this phase is candidate plagiarism passages, each passage consisting of three 
consecutive sentences from the suspicious or source documents. In this phase, each passage is encoded as a 
semantic embedding vector. The semantic similarity between two passages is calculated based on the 
distance between these vectors. We use SBERT to encode passages, since it is proved in [14] that SBERT is 
better than other methods (e.g., Word2Vec [15], Glove [16], Fastext [17], InferSent [18], or Universal 
Sentence Encoder [19]) in various domains. Features representing for each passage is derived from these 
passage vectors. They are then used as inputs for the binary classification at the passage level to detect 
whether two passages are similar or not. 


2.2.1. Passage-phase feature extraction 

Given a set of all candidate passages in the suspicious document U = (uj,u2,...,Un) and a set of all 
candidate passages in the source document V = (v7,V2,...,¥m), with each passage u; and v; is represented as a 
passage embedding vector. We propose the following features for this phase: 
— Maximize passage similarity 

This feature is used to determine the maximum similarity of a passage vector u; against a set of 
passage vectors V. Let us say SIMy,v; is the similarity between two passage vectors u; and v; where u; € U, v; 


EV. Let sim,, y is the maximum passage similarity of the passage vector u; against the set of passage vectors 
V. It is calculated as: 


simy,y = Max cosin(u;, Yj) (2) 
vjev 


The maximize passage similarity feature vector of all passage vectors in the pair of suspicious and 
source document is determined by (3): 


psim(U,V) = (simy,y, Simy, vy, o SIMy,,,v) SIMy, ys SIMy, Ys + SIMy,, ) (3) 


— Maximize passage intersection 

To determine the maximum intersection value of a passage u; with a set of passages V, we split 
passages into words and find the intersection words of each passage pair (u;, v;), with u; € U, v; € V and take 
the maximum length of this intersection. This value is calculated as in (4): 


intery,,y = maxlen(uj Nn v;) (4) 
7 vjeVv 


The maximize passage intersection feature vector of all passages in the pair of suspicious and source 
document is determined by (5): 


pinter(U,V) = (inter,, y,inter,,y,..,inter,, y, intery, y, intery, y,..., inter, uv) (5) 


— Passage importance 

Term frequency-inverse document frequency (TF-IDF) is the most widely used and considered one 
of the most appropriate term weighting schemes. This TF-IDF is employed to get rid of terms with lower 
weights from documents and helps to increase the retrieval effectiveness. Term frequency-inverse document 
frequency is a numerical statistic that tells us how important a word is to a document in a collection or a 
corpus. It is mostly used as a weighting factor in various processes used for information retrieval and text 
mining. To determine similar passages, we put forward the idea of term frequency-inverse sentence 
frequency (TF-ISF) [20]. We treat each passage as a document and each document as a corpus, then calculate 
the values of TF(w,U), TF(ui,U), and ISF(u;U), in which w is a term in a passage uj, U is the document 
containing uj. Given | u,| is the total number of words in the passage ui, TF(u; U) is computed as: 


Lw eu, TF(w,U) 


lujl 


TF (uj, U) = (6) 
ISF(u;,U) is computed by (7): 
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Lwe u,IDF(w,U) 


ISF (u;, U) = ial (7) 
The passage importance of the passage u; in the document U is determined by (8): 
imp yu,u = TFC uj, U) x ISF( uj, U) (8) 


The passage importance feature vector of all passage in the pair of suspicious and source document is 
determined by (9): 


pimp (U,V) = (impy, uv OMPy,,u) + UMP, ur MPy,,vs IMPy,,y IMP yn) (9) 


— The feature matrix for the passage-phase 

After extracting and creating three feature vectors psim(U,V), pinter(U,V), and pimp(U,V), we 
combine them into a two-dimensional matrix of size (n+m) x 3 where n+m is the total number of passages 
from suspicious and source documents. The feature matrix for all passages in the pair of suspicious and 
source documents is determined as in (10). It is used as the input for the multi-layer LSTM network model, 
described in section 2.2.2. 


simy,y inter,y IMpy vy 
_ [ sim,,y intern, y impy,y 


fpassage = (10) 


sim,,,u inter, y impy,v 


2.2.2. Plagiarism passage selection 

We build our binary classifier by using a multi-layer LSTM network model, which is used to predict 
the probability of being a plagiarism passage in the pair of suspicious and source documents. Figure 2 shows 
the structure of our model at the passage-phase. At this phase, we generate the input vectors by reshaping the 
feature matrix fpassage into a three-dimensional matrix of batch_size, time_steps, and seq_len and feed them 
into the model. The parameters using in the LSTM model are: (i) batch_size equals the number of passages; 
(ii) time_steps equals 1; (iii) seq_len equals the number of features (seq_len=3). 


Output 


LSTM layers 


Feature vector 


Input 


Figure 2. The architecture of the multi-layer LSTM model at the passage-phase 


The output of the sigmoid activation function is always in the range of (0,1). This function is applied 
to the output of all units in the last hidden LSTM layer. Let y = (1, V2, ---) ¥n4m) is the output of the binary 
classification model (0 < y; < 1), and n+m is the number of passages in the pair of suspicious and source 
documents. Figure 3 shows the output of the model is a vector of Os and 1s in which values 1 for all y; being 
higher than a threshold 6, and values 0 for the remaining. 
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Figure 3. The output of the model at the passage-phase 


Plagiarism passages are generated by selecting sentences corresponding to the longest values of 1 
from the output of the model. When observing and analyzing the plagiarism passages obtained, we found that 
most plagiarism passages contain entire sentences. However, the plagiarism paragraph contains several 
redundant words at the two ends, such as the example in the PAN 2014 corpus explained by: this example. In 
this example, the underlined text is inside the plagiarism paragraph, whereas the rest is redundant. 

The suspicious plagiarism paragraph: 


The capsule was designed for entry into the Martian atmosphere, descent to the surface, 
impact survival, and surface lifetimes of as much as six months and contained the power, guidance, 
control communications, and data handling systems necessary to complete its mission. is perhaps 
the most productive space probe yet deployed, visiting four planets and their moons, including two 
primary visits to previously unexplored planets, with powerful cameras and a multitude of scientific 
instruments, at a fraction of the money later spent on specialized probes such as the and the probe. 
Along with, and Voyager 2 is an .Voyager 2 Galileo spacecraft Cassini-Huygens [2] [3] Pioneer 10 
Pioneer I] Voyager I New Horizons interstellar probe resident per year, or roughly half the cost of one 
candy bar each year since project inception. 


The source plagiarism paragraph: 


Voyager 2 unmanned interplanetary space probe Voyager program Voyager 1 Voyager 2 
ecliptic Solar System Uranus Neptune gravity assist Saturn Voyager 2 Titan Planetary Grand Tour 
[1] is perhaps the most productive space probe yet deployed, visiting four planets and their moons, 
including two primary visits to previously unexplored planets, with powerful cameras and a 
multitude of scientific instruments, at a fraction of the money later spent on specialized probes such 
as the and the probe. Along with, , and Voyager 2 is an .Voyager 2 Galileo spacecraft Cassini- 
Huygens [2] [3] Pioneer 10 Pioneer 11 Voyager I New Horizons interstellar probe Contents Titan 
3E Centaur was originally planned to be, part of the. 


To solve this problem, we extend pairs of plagiarism passages from the suspicious and source 
documents by adding k sentences to the left and right of both passages. Extended passages will be used as the 
input for the word-phase to find exact plagiarism strings. It is done by removing redundant text from the 
extended plagiarism passages. The word-phase will be introduced next. 


2.3. Word-phase 

To remove the redundant text at the two ends of the extended plagiarism passages, we need to 
identify semantically related segments based on consecutive words of high similarity. To get the meaning of 
a word, we put that word in a window size of 3 with one word on the left and one word on the right. The text 
inside this window is used as the input of SBERT to create word feature vectors. 


2.3.1. Word-level feature extraction 

In this phase, three features are proposed based on the cosine similarity between the word and the 
sentence containing that word. The word similarity feature is a vector that contains the maximum similarity 
values of each word. The maximum similarity of a word in the suspicious passage is the maximum similarity 
of that word with each word in the source passage and vice versa. Features average word similarity and 
sentence based similarity are used to solve cases where the similarity value of a word has a big difference 
with the surrounding words. The average word similarity feature is a vector that each item is the average of 
the word similarity values within the sentence. The sentence based similarity feature is a vector that each 
item is the maximum of sentence similarities of the sentence containing that word. The detailed information 
on the word-phase features is explained by: 

Given the extended suspicious passage P=(pj,p2,...,)n), the extended source passage Q=(q1,q2, ...,m) 
with each word p; and gq; is represented by a word embedding vector. 
— Word similarity 

Let us call sim(pi,q;) is the cosine similarity between two word vectors p; and q;. The word similarity 
feature between P and Q is a vector being computed as (11). 
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wsim(P,Q) = cman sim(P, qj), ne siM(P2,9j), ++ ee siM(Qm,Pj;)) (11) 


— Average word similarity 

Given wi (with i= 1+n+m), is the i-th word in the pair of suspicious and source passages, d is the 
sentence that w; € d, and |d| is the total number of words in the sentence d. Let us call avg(wi) is the average 
similarity of word w; in the sentence d; wsim/(i) is the value of the i-th item in the word similarity feature 
vector. Then, the avg(w;) is computed as: 


_ Ywieawsim(i) 


avg(w;) = (12) 


The average word similarity feature between two passages P and Q is a vector determined by the following 
formula: 


wavge(P,Q) = (avg(p1), avg (p2),..., AV (Pn), AV8(qi), AVE(q2),.-, AV8(Gm)) (13) 


— Sentence based similarity 

We reuse the maximize passage similarity feature (as described in the passage-phase) with the 
meaning of the passage is the sentence. Given the set of sentences U = (u1,U2,...,Ux), and V = (Vi,V2,...,Vs) in 
the suspicious and source passages, respectively. Let us call sim_sent(p;) is the sentence based similarity of 
word p; in the sentence uj. The sim_sent(p;) is computed as: 

sim_sent(p;) = maxcosin(u;,v,)|Vp;, € uy (14) 

viEV 

The sentence based similarity feature between two passages P and Q is a vector determined by the 

following formula: 


wsent(P,Q)=(sim_sent(p1),sim_sent(p2), ...,sim_sent(pn),sim_sent(q1),sim_sent(q2), ...,sim_sent(qm)) 
(15) 


The feature matrix for the word-phase: 
After computing three feature vectors wsim(P,Q), wavg(P,Q), and wsent(P,Q), we combine these 
feature vectors into a two-dimensional matrix of size (n+m) x 3. 


maxsim(p,,qj) avg(p,)_sim_sent(p,) 
qj 


(16) 


maxsim(p2,q;) avg(p2) sim_sent(p2) 
fwora = ajeQ 7 


max sim(dm,P;) @79(4m) sim_sent(qn) 
Pj 


The feature matrix of all the extended plagiarism passages is determined by (16). This feature matrix is used 
as the input for the multi-layer LSTM model, described in section 2.3.2. 


2.3.2. Plagiarism string selection 

In this section, we conduct two processing steps: (1) select plagiarism sentences and (ii) remove 
redundant text. The details of each step are described as: 
— Select plagiarism sentences 

To select exact plagiarism sentences from the extended plagiarism passages, we use a multi-layer 
LSTM model whose input is taken from the feature matrix ford as Shown in Figure 4. The parameters using 
in this model are: (i) batch_size equals the number of words; (11) time_steps equals 1; (iii) seg_len equals the 
number of features (seq_len=3). 

In Figure 4, p; and qj denotes the i-th and j-th word in the pair of extended plagiarism passages, 
y_pred = (4, V2, +») Yn+m) is the output of the binary classification model (0 < y; < 1), n+m is the total number 
of words in the pair of these passages. The predicted mean value of a sentence u is computed as in (17): 


y_pred_sent,, = avg(y_pred,,) = enieuns (17) 


Jul 
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where w; is a word in the sentence u. 

After computing values y_pred_sent, for all sentences, we create a vector with the size 
corresponding to the total number of sentences in the pair of plagiarism passages. If the value of y_pred_sent 
of a sentence is higher than a threshold /, the value corresponding to that word in the sentence is 1; 
otherwise, it is 0. We select the longest strings with the value of | as the plagiarism sentences. 


Output 
LSTM layers 
maxsim (p,.q;)} |max sim (P24j)| [max sim (P34) max sim (4m-Pj) 
Feature 
avg(~,) avg(P2) avg(P3) av9(Gm) 
vector 


sim. sent(@) 


= Ha 


Figure 4. The architecture of the multi-layer LSTM model at the word-phase 


— Remove redundant text 

To achieve the exact plagiarism strings, we consider the leftmost plagiarism sentence and the 
rightmost one. The difference between these sentences’ max_threshold and min_threshold is higher than 
ti (t:=0.4). The max_threshold and min_threshold of a sentence u are determined by (18) and (19): 


max_threshold,, = max y; (18) 
wjeu 

min_threshold, = miny; (19) 
wjeu 


with w; is a word in the sentence u. 
These sentences above have one part inside and the remaining part outside the plagiarism passage. 
The outside part is on the left (orient =/) if the sentence is on the left of the plagiarism sentences or on the 
right (orient =2) if the sentence is on the right of the plagiarism sentences. If the previous step result contains 
only one sentence, the outside part belongs to the two ends (orient =3) of the sentence. Analyzing the output 
vector of the LSTM model y_pred, we discover that the predicted value y; corresponding of the inside words 
is much higher than the predicted value y; corresponding of the outside ones. 
Algorithm 1 is used to cut off the redundant text from these sentences. The idea of this algorithm is: 
Given a threshold a, find the longest text in the leftmost sentence and the rightmost one whose all of their 
words have the predictive value y_pred < a. We defined the left and right position as the first and last word 
of the exact plagiarism strings, respectively. The algorithm receives the following parameters as inputs: 
—  y_d: is the predicted vector of the sentence. y_d = (y_d,, y_dg, ..., y_d¢) with t is the number of words 
in the sentence. 
— orient: determines the intersection position in the left (orient =/) or right (orient =2) or both sides 
(orient =3) of boundary sentences. 


Algorithm 1: Intersection position determination 


Input: y d, orient 
ds # orient = 1: left; orient = 2: right; orient = 3: both 
pos_left = 0; pos right = length(y_d) - 1 
a = min(y_d) + (max(y_d)-min(y d))/2 
if orient = 1 or orient = 3 then 
for i = 0 to length(y_d) - 1 do 
: if y d [i] > a then 


Nu PWN 
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Ls pos left =i 

8 break 

Oi if orient = 2 or orient = 3 then 

LO for i = length(y_d) - 1 downto 0 do 
plglies if y d[i] > a then 

12:3 pos right =i 

13 break 


Output: pos left, pos right 


We initialize the left and right positions with the first and last points, respectively (lines 2). The 
threshold a is the average value of maximum and minimum of y_d vector. We define the left (line 4) and 
right (line 9) position based on the orient value. For each direction, we scan all the points (line 5 and line 10) 
and get the first points whose predict value y_pred are higher than the threshold a (line 7 and line 13). These 
points are the results of the algorithm. 


3. EXPERIMENT RESULTS AND DISCUSSION 

In our experiment, we use PAN 2013 text alignment training corpus [21] for training the system. 
This corpus is also the training corpus using in PAN 2014 competition. The PAN 2013 corpus consists of 
1000 no obfuscation, 1000 random obfuscation, 1000 translation obfuscation, and 1185 summary obfuscation 
pairs of documents. Normally, this corpus is too small for training a deep learning model. By our experiment, 
we will prove that our approach of combining hand-crafted features with the LSTM model will be a good 
solution for this problem. To compare our system performance with state-of-the-art research in this task, we 
used PAN 2014 text alignment test corpus [22] for evaluating the system. 


3.1. Evaluation metrics 

Our system was evaluated by using a tool provided by PAN to measure the system performance. 
Four measures used in PAN are macro-averaged Precision, Recall, Plagdet, and Granularity. The formula to 
compute these values are described such as: 

Given S, R, s, r are a set of all plagiarism cases, a set of all plagiarism system-detection cases, a 
plagiarism case, and a plagiarism system-detection case, respectively. The macro-averaged precision and 
recall are defined by: 


prec(S,R) = 7X Drew soo (20) 
rec(S,R) = 2X Yises SO Q1) 


The detection granularity of R under S indicates whether each plagiarism case s € S is detected as a 
whole or in several pieces. It is calculated as: 


gran(S,R) = 27 Lsesg IRs! (22) 


where Sr CS are cases detected by detections in R, and Rs C R are the detections of a given s. 
Plagdet is the overall score of the system, which is calculated as: 


2xprecxrec 1 


plagdet(S,R) = (23) 


prect+rec log2(1+gran(S,R)) 


3.2. Experimental results and analysis 

Several tests have been carried out to choose the best configuration for our system. We performed 
experiments by each phase to optimize parameters of the system. Extracted feature vectors from pairs of 
documents in the PAN 2013 training corpus are passed to the multi-layer LSTM model during the training 
process. We chose binary_crossentropy as the loss function since the model is a binary classification model. 
The threshold 0, which is used to select sentences in the passage-phase, is chosen to be 0./. To choose the 
value k (mentioned in section 2.2.2) for extending plagiarism passages, we initiate the k value by | and 
continuously increasing this value until the system reaches the highest recall value. Experiments proved that 
the value of k depends on the length of the plagiarism passages, as shown in Table 1. 

At the word-phase, instead of using thresholds to identify each word, we apply the threshold 
fh (B = 0.1) to the y_pred_sent. The LSTM model generates an array whose size is equal to the number of 
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sentences. The value of the array’s element is 1 if y_pred_sent is higher than /, and 0 for others. Then we 
select a continuous string with the highest predicted value. Table 2 shows the accuracy and loss values in the 
LSTM training phase with the four datasets in PAN 2013. 

To evaluate the effectiveness of our proposed features, we carried experiments using each feature 
instead of all features, with the input is pairs of documents from PAN 2014 test corpus. Figure 5 shows the 
effect of these features at the word-phase on the system output. Three pairs of Figures 5(a) to 5(f) show the 
prediction results of y_pred and the final results using 1, 2, and 3 features, respectively. In these figures, the 
blue line shows the predicted result; the red line shows the average predicted value by sentences. The green 
line separates the suspicious and source passage; the black line shows the range of the selected plagiarism 
passages. The evaluation results proved that all the proposed features are useful, solving well for both literal 
plagiarism and intelligent plagiarism. 


Table 1. The dynamic parameters Table 2. Accuracy and loss values of the training phase 
for extending passage PAN 2013 taining. conus Sentence level Word level 
Plagiarism passage’s length __k Accuracy Loss Accuracy Loss 

1 >6 sentences 1 None Obfuscation 0.9925 0.0068 0.9808 0.0161 
2 >3 sentences a) Random Obfuscation 0.9727 0.0814 0.9303 0.1909 
3 >2 sentences 3 Translate Obfuscation 0.9707 0.0748 0.9443 0.1229 
4 1 sentence 4 Summary Obfuscation - - 0.9201 0.2096 
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Figure 5. Effects of selecting different features at word-phase to plagiarism passage: (a) using one feature- 
wsim (P,Q); (b) output’s result when using wsim (P,Q); (c) using two features-wsim (P,Q), wavg (P,Q); 
(d) output’s result when using wsim (P,Q), wavg (P,Q); (e) using three features-wsim (P,Q), wavg (P,Q); 

wsent (P,Q); and (f) output’s result when using wsim (P,Q), wavg (P,Q), wsent (P,Q) 
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Table 3 compares our system performance compared with existing research on this task using PAN 
2014 as the test set. It shows that our system has a remarkable improvement comparing to other researches. It 
indicates that our system can detect most plagiarism cases comparing to others. The results prove that our 
proposed feature extraction techniques combining with our LSTM models is a promise solution for the case 
of detecting intelligent plagiarism in which the same content can be expressed in different ways and by 
different words, using a small training corpus. 


Table 3. Performance comparison with state-of-the-art approaches 


Team F-measure (%) Plagdet (%) Prec (%) Rec (%) Gran 
Our system 94.26 94.26 94.04 94.48 1.00000 
Palkovskii and Belov [ 23] 90.80 90.78 92.76 88.92 1.00027 
Alaa Saleh Altheneyan et al. (PlagLinSVM) [11] 90.15 90.01 89.75 90.55 1.00210 
Oberreuter and Eiselt [24] 89.30 89.27 87.17 91.54 1.00051 
Sanchez-Perez et al. [25] 89.21 89.20 86.61 91.98 1.00026 
Glinos [26] 89.89 88.77 96.01 84.51 1.01761 
Alaa Saleh Altheneyan et al. (PlagRbfSVM) [11] 88.40 88.27 85.52 91.49 1.00209 
Shrestha et al. [27] 87.05 86.81 84.42 89.84 1.00381 
Gross and Modaresi [28] 86.84 85.50 92.52 81.82 1.02187 
Rodriguez Torrejon and Martin Ramos [29] 84.87 84.87 90.03 80.27 1.00000 


* The best results are highlighted in bold. 


When analyzing our system output, we found that most of the incorrect results are due to the 
following situations: 

—  Sentential redundancy: This situation occurs when the sentence near the plagiarism passage is 
semantically related to the plagiarism passage. In that case, the system often includes it to the 
plagiarism passage. 

— Word missing or redundancy: This situation occurs when only a part of the sentence is in the plagiarism 
passage. The pre-processing step has removed stopwords from the input documents. Therefore, when 
restoring the original text from the output of the word-phase, we need to recover these stopwords from 
the original documents. Redundance or some missing stopwords may occur at the beginning and the end 
of the recovery passage. 

These problems will be considered in our future work. 


4. CONCLUSION 

This paper has proposed an approach using feature extraction techniques and a two-phase plagiarism 
detection system based on multi-layer LSTM Networks to determine plagiarism strings between two 
documents. The key to the paper's success is to select appropriate features for both word matching and 
semantic-based plagiarism. Besides, the inheritance of research results on measuring the similarity between two 
sentences is also an essential factor in catching sentences inside plagiarism passages compared to outside sentences. 
The proposed method was evaluated using the PAN 2014 text alignment corpus and widely accepted 
evaluation metrics: precision, recall, and plagdet. The solution achieves the best recall and plagdet, and the 
second precision better compared to state-of-the-art systems. In our future work, we plan to find a method to 
automatically choose optimal parameters for our system. Also, we will investigate methods to solve the 
redundancy problem in the system’ output, mentioned in section 3.2. 
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